[VL] Support mapping columns by position index for ORC and Parquet files#10697
[VL] Support mapping columns by position index for ORC and Parquet files#10697rui-mo merged 1 commit intoapache:mainfrom
Conversation
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
b503cb0 to
2ac85a5
Compare
|
Run Gluten Clickhouse CI on x86 |
2ac85a5 to
7d91ef5
Compare
|
Run Gluten Clickhouse CI on x86 |
7d91ef5 to
3eaad25
Compare
|
Run Gluten Clickhouse CI on x86 |
3eaad25 to
4e6d1e5
Compare
|
Run Gluten Clickhouse CI on x86 |
4e6d1e5 to
fd6165c
Compare
|
Run Gluten Clickhouse CI on x86 |
fd6165c to
f996f22
Compare
|
Run Gluten Clickhouse CI on x86 |
f996f22 to
d034d21
Compare
|
Run Gluten Clickhouse CI on x86 |
1997c88 to
a23b939
Compare
|
Run Gluten Clickhouse CI on x86 |
a23b939 to
542c2d2
Compare
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
rui-mo
left a comment
There was a problem hiding this comment.
Thanks for the update. Just one nit and the other change LGTM.
fa4e9b7 to
aa450a4
Compare
|
Run Gluten Clickhouse CI on x86 |
rui-mo
left a comment
There was a problem hiding this comment.
Here’s what comes to mind. There are the below three strategies for column mapping:
- Match by position
- Match by field name
- Match by unique permanent ID
And I suppose Spark only supports (2) and (3) (seeing facebookincubator/velox#6065 (comment)), while Velox supports (1) and (2). Would you please clarify which is supported in this PR?
@rui-mo This exposes Velox's support for (1) |
|
@kevinwilfong I wonder if Spark supports matching by the position index, and I assumed it only supports matching by the file ID or column name. Please correct me if it's wrong. |
|
Vanilla Spark partially supports it https://issues.apache.org/jira/browse/SPARK-32864 and can be extended to do so by customizing readers (it's what we do) |
|
cc: @zhztheplayer If there are no further comments, we can proceed to merge this PR. |
|
@rui-mo Please proceed. Thank you very much for the review. |
|
@kevinwilfong I encountered the issue too. I picked up this PR to my Gluten. But the problem still exists. And query this table with Spark SQL. The output show below. I set these configs show below. But nothing helped. Did I miss something? |
|
@beliefer That's strange, do you get the same results if you don't set spark.gluten.sql.columnar.backend.velox.orcUseColumnNames to false? Nothing in your repro changes the schema of your data, so it should work regardless of the value of that config |
|
@kevinwilfong Yes. No matter the value of |
|
In that case, as was mentioned in the Issue that was linked, I think it's not related to this change. |
What changes are proposed in this pull request?
In our data warehouse we support schema evolution by column index rather than by name. E.g. if a Hive table has schema a, b, c but the partition has schema c, a, b we won't reorder the columns from the partition, but read partition column c as column a, partition column a as column b, etc.
This is supported in Velox by setting the configs hive.orc.use-column-names and hive.parquet.use-column-names in the HiveConfig to false for ORC and Parquet files respectively. Currently these are both hard coded to true in Gluten. This change adds configs to Gluten's VeloxConfig spark.gluten.sql.columnar.backend.velox.orcUseColumnNames and spark.gluten.sql.columnar.backend.velox.parquetUseColumnNames and plumbs these to the HiveConfig in Velox.
In addition, we need to pass the full table schema to the HiveTableHandle, as this is how Velox determines the indices of each column. I updated VeloxIteratorApi to set the FileSchema for the LocalFilesNodes it generates if necessary (if the config is enabled for the format of the file), and VeloxPlanConverter/SubstraitToVeloxPlan to propagate this to the HiveTableHandle when present.
Note that I considered just setting it in the ReadRel rather than in each LocalFilesNode. This however introduced the problem that we could no longer read from tables with column types we don't support, even if we don't read those columns, as we still need to propagate them to the HiveTableHandle. Since partition file formats don't always match table file formats, we don't know if we need the schema until we generate the splits, at which point it's too late to update the plan. See #10569
Please note that vanilla Spark partially supports matching by the position index https://issues.apache.org/jira/browse/SPARK-32864 and can be extended to do so by customizing readers.
How was this patch tested?
Added tests for ORC and Parquet files where the column names in the table don't match the column names in the file, and verified we could still read them by index when the flags are enabled.